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Abstract 

We address the asymptotic and approximate distributions of a large class of test statistics with 
quadratic forms used in association studies. The statistics of interest do not necessarily follow a 
chi-square distribution and take the general form D = X T AX, where X follows the multivariate 
normal distribution, and A is a general similarity matrix which may or may not be positive semi- 
definite. We show that D can be written as a linear combination of independent chi-square random 
variables, whose distribution can be approximated by a chi-square or the difference of two chi- 
square distributions. In the setting of association testing, our methods are especially useful in two 
situations. First, for a genome screen, the required significance level is much smaller than 0.05 due 
to multiple comparisons, and estimation of p-values using permutation procedures is particularly 
challenging. An efficient and accurate estimation procedure would therefore be useful. Second, in 
a candidate gene study based on haplotypes when phase is unknown a computationally expensive 
method-the EM algorithm-is usually required to infer haplotype frequencies. Because the EM 
algorithm is needed for each permutation, this results in a substantial computational burden, which 
can be eliminated with our mathematical solution. We assess the practical utility of our method 
using extensive simulation studies based on two example statistics and apply it to find the sample 
size needed for a typical candidate gene association study when phase information is not available. 
Our method can be applied to any quadratic form statistic and therefore should be of general 
interest. 

Key words: quadratic form, asymptotic distribution, approximate distribution, weighted chi- 
square, association study, permutation procedure 
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Introduction 

The multilocus association test is an important tool for use in the genetic dissection of com- 
plex disease. Emerging evidence demonstrates that multiple mutations within a single gene often 
interact to create a "super allele" which is the basis of the association between the trait and the 
genetic locus [Schaid et al. 2002]. For the case-control design, a variety of test statistics have been 
applied, such as the likelihood ratio, \ 2 goodness-of-fit, the score test, the similarity- or distance- 
based test, etc. Many of these statistics have the quadratic form s T As, or are functions of quadratic 
forms, where s is a function of the sample proportions of haplotype or genotype categories and A 
is the similarity or distance matrix. Some of these test statistics follow the chi-square distribution 
under the null hypothesis. For those that do not follow the chi-square distribution, the permutation 
procedure is often performed to estimate the p-value and power [Sha et al., 2007, Lin et al. 2009]. 

Previous attempts to find the asymptotic or approximate distribution of this class of statistics 
have been limited or case- specific. Tzeng et al. [2003] advanced our understanding of this area 
when they proposed a similarity-based statistic T and demonstrated that it approximately followed 
a normal distribution. The normal approximation works well under the null hypothesis provided 
that the sample sizes in the case and control populations are similar. However, the normal approx- 
imation can be inaccurate when the sample sizes differ, when there are rare haplotypes or when 
the alternative hypothesis is true instead, as we describe later. Schaid [2002] proposed the score 
test statistic to access the association between haplotypes and a wide variety of traits. Assum- 
ing normality of the response variables, this score test statistic can be written as a quadratic form 
of normal random variables and follows a chi-square distribution under the null hypothesis. To 
calculate power, Schaid [2005] discussed systematically how to find the non-central parameters 
under the alternative hypothesis. However, their result cannot be applied to the general case when 
a quadratic form statistic does not follow a non-central chi-square distribution. In the power com- 
parisons made by Lin and Schaid [2009], power and p-values were all estimated using permutation 
procedures. However, a permutation procedure is usually not appropriate when the goal is to esti- 



4 



mate a probability close to or 1. Thus, if the true probability p is about 0.01, 1,600 permutations 
are needed to derive an estimate that is between p/2 and 3p/2 with 95% confidence. The number 
of permutations increases to 160,000 if p is only 0.0001. Consequently, permutation tests are not 
suitable when a high level of significance is being sought. 

The permutation procedure can also be very computationally intensive when estimating power. 
In a typical power analysis, for example, the significance level is 0.05 and power is 0.8. Under 
these assumptions the p-value could be based on 1,000 permutations. Subsequently if the power 
of the test is estimated with 1,000 simulations, the statistic must be calculated 1,000,000 times. 
Moreover, to apply the multilocus association test method to genome-wide studies, the required 
significance level is many orders of magnitude below 0.05 to account for multiple comparisons 
and even 1 ,000 permutations will be completely inadequate. 

Additional complications arise with permutations since most of the data in the current genera- 
tion of association studies are un-phased genotypes. To explore the haplotype-trait association, the 
haplotypes are estimated using methods such as the EM- algorithm [Excoffier and Slatkin, 1995; 
Hawley and Kidd, 1995] or Bayesian procedures [Stephens and Donnelly, 2003]. Two compu- 
tational problems arise in this situation. First, the resulting haplotype distribution defines a very 
large category because all the haplotypes consistent with the corresponding genotypes will have a 
positive probability. Therefore, the number of rare haplotypes is usually greater than when phase 
is actually observed. Second, the process is again computationally intensive because the haplo- 
type distribution needs to be determined for each permutation. To solve these problems, Sha et al. 
[2007] proposed a strategy where each rare haplotype is merged with its most similar common hap- 
lotype, thereby reducing the number of rare haplotypes and leading to a computationally efficient 
algorithm for the permutation procedure. This method is considerably faster than the standard EM 
algorithm. However, since it is still based on permutations it is not a perfect solution to the com- 
putational problem. Moreover, the process of pruning out rare haplotypes can lead to systematic 
bias in the estimation of haplotype frequencies in some situations. 
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Based on these considerations, it is apparent that a fast and accurate way to estimate the cor- 
responding p-value and associated power would be an important methodological step forward and 
make it possible to generalize the applications of these statistics. In this paper, we explore the 
asymptotic and approximate distribution of statistics with quadratic forms. Based on the results 
of these analyses, p-values and power can be estimated directly, eliminating the need for permuta- 
tions. We assess the robustness of our methods using extensive simulation studies. 

To simplify the notation, we use the statistic S proposed by Sha et al. [2007] as an illustrative 
way to display our methods. We first assume that the similarity matrix A is positive definite. 
We then extend this analysis to the case when A is positive semi-definite and the more general 
case assuming symmetry of A only. In the simulation studies, we use qq-plots and distances 
between distributions to explore the performance of our approximate distributions. In addition, we 
examine the accuracy of our approximations at the tails. Likewise, we assess the performance of 
our approximation under the alternative hypothesis by examining the qq-plots, distances, and tail 
probabilities. As an additional example, we apply our method to the statistic T proposed by Tzeng 
et al. [2003] and compare the result with the normal approximation. Finally, we use our method to 
find the sample size needed for a candidate gene association study when linkage phase is unknown. 

Methods 

Assume that there are k distinct haplotypes (hi, ■ ■ ■ , h k ) with frequencies p = (pi, ■ ■ ■ ,Pk) T in 
population 1, and q = (qi, ■ ■ ■ , qk) T in populations 2. To compare p and q, we assume that sample 
1 and sample 2 are independent and are collected randomly from population 1 and population 2 
respectively. Let rij and rrij, j = 1, • • • , k, represent the observed count of haplotype hj in sample 
1 and sample 2 respectively. We use the same notation as in Sha et al. [2007]: 

n = Yli=i n i = s i ze °f sample 1, 
m = Yli=i m i = s i ze °f sample 2, 
p= (pi,---,Pk) T = (ni,---,n fe ) T /n, 
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Q = (Qi, %) T = (mi, m k ) T /m, 

ciij = S(hi, hj) is the similarity score of haplotypes hi and hj, 

A = (ciij) is a.k x k similarity matrix. 

Let s = p — q and s — p — q. Then Sha et al.'s statistic is defined as S = (s T As)/cr , where 
cr is the standard deviation of s T As under the null hypothesis. In this paper, we focus on the 
distribution of D s = s T As since <r is a constant. 
Write D s as a function of independent normal random variables 

Assume that the observed haplotypes in sample 1 are independent and identically distributed 
(i.i.d.), then the counts of haplotypes (ni, • • • , n k ) follow the multinomial distribution with pa- 
rameters (n;pi, ■ ■ ■ ,pk). Therefore, fi p = E(p) = p and S p = Var(p) = (P — pp T )/n, where 
P = diag(pi, • • • ,pk) is ak x k diagonal matrix. According to multivariate central limit theorem, 
p asymptotically follows a multivariate normal distribution with mean fj, p and variance S p when n 
is large. A similar conclusion can be applied to q if replacing p with q, P with Q and n with m. 
Assume that samples 1 and 2 are independent. Then we conclude that s is asymptotically normally 
distributed with mean vector s = p — q and variance S s = S p + S 9 . 

Let r a denote the rank of E s . Then r a < k — 1 since s = (si, • • • , s k ) T only has k — 1 free 
components. If we assume pi + qi > for alii = 1, • • • , fc, then r CT = k — 1. Since S s is symmetric 
and positive semi-definite, there exists a. k x k orthogonal matrix U = (ui, • • • , u k ), and diagonal 
matrix A = diag(Ai, • • • , A r<T , 0, • • • , 0), such that S s = f/Af/ T and Ai > • ■ ■ > \ Ta > 0. 

Now define matrices U a = (ui, • • • , u Ta ), A CT = diag(Ai, • • • , A r<7 ), and 5 = [^(Ao-)^. Then 
S s = U a A a U^ = BB T and there exists r a independent standard normal random variables Z = 
(Z 1: ■ • • , Z rCT ) such that s m BZ + s for sufficiently large n and m. Then we have 

D s = s T As 

w (BZ+s) T i(5Z + s) 
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= Z T B T ABZ + 2s T ABZ + s T As (1) 

We then write W = B T AB = (A a )^U^ AU a (A a )^ . Since W is a r a x r a symmetric matrix, there 
always exists a r a x r a orthogonal matrix V and a diagonal matrix = diag(c<Ji, • • • , u ra ) such 
that iy = VVLV T , where a>i > • • • > w rCT are eigenvalues of W. 
Asymptotic and approximate distributions of D s with the assumption s = 

Now let us consider the asymptotic distribution of D s under the null hypothesis H : p = q. 
That is, s = 0. Let D represent the test statistic under H . Then we have D = 
Z T VW T Z. Let y = (y, • • • , yj T = V T Z. Then y ~ N(0, J r J, where J r<T is the r a x r CT 
identity matrix, and 

i=i 

Ca^e /: r/ze similarity matrix A is positive semi-definite 

Under these assumptions W will also be positive semi-definite. That is, u\ > ■ ■ • > uJ Ta > 0. Then 
D follows a weighted chi-square distribution asymptotically. To calculate the corresponding p- 
values efficiently, we could use a chi-square distribution to approximate it. 

According to Satorra and Bentler [1994], the distribution of the adjusted statistic f3D can be 
approximated by a central chi-square with degrees of freedom df , where (3 is the scaling parameter 
based on the idea of Satterthwaite et al. [1941]. This method is referred as 2-cum chi-square 
approximation since the parameters j3 and df are obtained by comparing the first two cumulants 
of the weighted chi-square and the chi-square. Specifically, let W be a consistent estimator of W. 
Then 

pD ~ X 2 d fo 

approximately, where (3 = tr(W)/tr(W 2 ), df = (tr(W)) 2 /tr(W 2 ), and tr(-) is the trace of a matrix. 
Note that it is not necessary to estimate W because tr(W) = tr(B T AB) = tr(ABB T ) = tx(AE s ), 
and tr(W^ 2 ) = tr(B T ABB T AB) = tr(AS s AS s ), where S s is a consistent estimate of S s . 



8 



Assume that the observed value of D s is d s . Then the p-value can be estimated using the 



following formula 



p-value 



P Ho (D > 1) « p (x 2 dfo > Pi) 



(3) 



Alternatively, assume that the significance level is a and the value c* is the quantile such that 
P(x% > c* ) = a. Then the critical value of D s to reject H at level a is 



The above formulas indicate that the degrees of freedom dfo and the coefficient (3 of the chi-square 
approximation can be calculated directly from the similarity matrix and the variance matrix - a 
major advantage of this method since matrix decomposition can be very slow and inaccurate when 
the matrix has high dimensionality. 

Case II: The similarity matrix A is NOT positive semi-definite 

In the above chi-square approximation, we assume that the similarity matrix A is positive semi- 
definite. However, many similarity matrices do not satisfy this condition. For example, consider 
the length measure of the first 5 haplotypes in Genel (Table 1 in Sha et al. 2007]. The similarity 
between two haplotypes is defined as the maximum length of a common consecutive subsequence. 
The eigenvalues of the similarity matrix A are (2.84, 1.21, 0.60, 0.36, —0.015). Therefore, A is not 
positive semi-definite. 

In this case, formula © is still true though formulas ©-© do not necessarily hold. A simple 
solution to this general case is to use the Monte Carlo method to estimate the p-value by generating 
independent chi-square random variables with known or estimated cjj. More specifically: Assume 
that the observed value of statistic D is d . Run iV simulations. For each simulation t, t = 
1, • • • , N, generate r a independent standard normal random variables y t i, • • • , ytr a - Then calculate 
d\ = Y^jLi^jVtj- The p-value can be estimated using the proportion of d° t that is greater than 
or equal to d . This method is not as good as the one based on formula ©, which calculates 



(4) 
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the p-value directly although, compared to the permutation procedure, it is computationally much 
simpler and faster. 

Alternatively the eigenvalues can be separated into positive and negative groups. With es- 
timated Wi, the sum of the positive group can be approximated by a single chi-square random 
variable, and as can the negative group. The corresponding p-value based on the difference of 
two chi-square random variables may be estimated by the Monte Carlo method or the technique 
described in Appendix D, which is used in all of our simulation studies. 
Asymptotic and approximate distributions of D s without the assumption s = 

In this section, we would like to find the asymptotic distribution of D s provided p and q are 
known but not necessarily equal. This is a typical situation for power analysis. In this case, the 
values of s = p — q and S s = S p + S g = (P — pp T )/n + (Q — qq T )/m are both known. Note that 
since S s is singular, it is not correct to write D s = [Z + B~ l s) T B T AB(Z + £ _1 s) since B^ 1 is 
not well defined. Though B^ 1 can be defined as the general inverse of B, it is impossible to find a 
B^ 1 such that BB^ 1 = Ik since its rank is at most k — 1. Therefore, the following discussion for 
the case when S s is singular is not as straightforward as that when S s is non-singular. 
Case I: The similarity matrix A is nonsingular 

Then W = B T AB = (A^C/J 'AU a (A a )^ is nonsingular since A CT is nonsingular and rank(f/ cr ) = 
r a . So the eigenvalues of W are non-zero. That is, oj x ^ 0, • • • , u r<7 ^ 0. Therefore, Vl^ 1 = 
diag(l/cj 1 , • • • , l/uj ra ) is well-defined. Let 

b = (TV 7 \K)^U 7 a 'As 

c = s T As-b T ttb, (5) 
Starting from equation (QQ), the statistic D s can be written as (see Appendix A for proof) 

D s = (Y + bfn(Y + b) + c = J2u i (Y i + b i ) 2 + c (6) 

8=1 

where Y follows the multivariate standard normal distribution. Provided that the similarity matrix 
A is positive definite, then W will also be positive definite. We may assume that u>i > • • • > co r<y > 
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0. In this case, a non-central shifted chi-square distribution can be used for approximation. Note 
that when S s is non-singular, it is a special case of formula © with r CT = k,U a = U, and A CT = A. 
In this case, it is easy to verify that c = s T As — b T VLb = 0. 

Liu et al. [2009] proposed a non-central shifted chi-square approximation for quadratic form 
D = X 7 AX by fitting the first four cumulants of D, where A is positive semi-definite. In their 
settings, X follows a multivariate normal distribution with a non-singular variance matrix. How- 
ever, in our case, the rank of the variance matrix S s is at most k — 1. Following the idea of Liu 
et al. [2009], we are able to derive the corresponding formula to fit our case (see Appendix B for 
details). Here we only define the necessary notation and list the final formula. This method is 
referred as a 4-cum chi-square approximation. 

Following Liu et al. [2009], define k v = T~ x {v - l)!(tr((AE s ) l/ ) + us T (AE s ) v ~~ 1 As), v = 
1,2,3,4. Then let s 1 = k§/(8kI|) and s 2 = k 4 /(12k|). If s 1 < s 2 , let 5 = and df a = l/s 1 . 
Otherwise, define f = 1/(^1 - V s i ~ and let S = g*(£^si - 1) and df a = ^ 2 (3 - 2£ V /II). 
Now let p 1 = y/2(df a + 25)/ k 2 , and p 2 = df a + 5- ^k x . Then 

Let d* a be the critical value as defined in equation ©. Then the power to reject H at significance 
level a can be estimated using the following formula: 

power = P Ha (D s >d* a ) 

« P(x% a (t)>fcd* a + f32) (7) 

Note that this 4-cum approximation is applicable not only under H a , but also under H . There- 
fore, it can be used to find the p-value or define the critical value for rejection. Under H , the true 
haplotype frequencies p and q are usually unknown, although the difference s = p — q is assumed 
to be zero. Therefore, to find the corresponding (5\ and /3 2 , we can use to replace s and S s to 
replace E a . Then the p-value is estimated as 

p-value = P Ho (Ds>d s ) 



11 

« P(x% a (5)>pJ s + p 2 ) (8) 
or alternatively, the critical value for rejection is 

£ « (4 - AO/A, 

where c* is the quantile such that P(x% a {^) — c *a) = a - Note that 5 is automatically if s = 
0. To prove this, it is sufficient to prove that si < s 2 , which is equivalent to [tr((AS s ) 3 )] 2 < 
[tr((y4S s ) 2 )][tr((AS s ) 4 )], which is a direct conclusion from Yang et al. 2001. 

If A has negative eigenvalues, the approximations in formula © and ([8]) are not valid. How- 
ever, equation © is still true. In this case, we can use the same strategy as discussed in the case 
assuming s = to estimate the power or p-value. 
Case II: The similarity matrix A is singular 

If A is singular, that is, rank(A) = r a < k, there exists an orthogonal matrix G = (<7i, • • • , <7&) and a 
diagonal matrix T = diag(7i, • ■ ■ , r y Ta , 0, • • • , 0), where 71 7^ 0, • • • , r ) Ta 7^ 0, such that A = GTG T . 
Let G a = (gi, ■ ■ ■ , g r J and T a = diag(7i, • • • , 7 r J. Then A can be written as A = G a T a G^. Now 
define s a = G^s. We have D s = s T As = s^T a s a , where T a is nonsingular and s a asymptotically 
follows a normal distribution with mean ji a = G^s and variance S a = G^S s G a . Therefore, even 
if A is singular, we can perform the above calculation to reduce its dimensionality and convert 
it into a non-singular matrix T a . Then by replacing s with jj, a , S s with S a , and A with T a , the 
discussion presented in Case I applies. 
Applications and extensions of our method 

For illustrative purposes, we start the discussion with the statistic D s proposed by Sha et al (2007). 
Actually, our method can be applied to a much more general statistic D, as long as it can be written 
as the quadratic form D = X T AX with X ~ N k (fx x , H x ) and A being a k x k symmetric matrix 
which is not necessarily positive semi-definite. 

When S x is nonsingular, the distribution of D is straightforward because D can be written as 

1 1 

D = [Z + 6) T SJ AYjI (Z + b), where Z = (Zi, . . . , Z^) T are i.i.d. normal random variables, and 
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b = S x 2 ji x with SJ being a symmetric matrix with S| SJ = E x . Then D follows a weighted non- 

i i 

central chi-square distribution. Moreover, if AY^l is idempotent, all the weights will be either 1 
or 0. Therefore D will follow a non-central chi-square distribution with degrees of freedom equal 
to the rank of A. However, when T, x is singular, the above conclusion does not hold. In this paper, 
we not only show why D can be written as a linear combination of chi-square random variables and 
how to estimate the corresponding parameter values, but also how to approximate its distribution 
using a chi-square or the difference of two chi-squares. To further illustrate the application of our 
method, we will discuss two more examples as follows. 

First, let us consider the test statistic defined by Tzeng et al. (2003]. To keep the notation 
consistent with ours, the form of the statistic is written as T = D t /a , where D t = p T Ap — 
q T Aq and a is the standard deviation of D t under the null hypothesis. It was claimed that T 
is approximately distributed as a standard normal under the null hypothesis. However, we found 
that the normal approximation can be inappropriate in some situations. Write D p = p T Ap and 
D q = q T Aq and assume that A is positive definite. Then from our previous discussion, D p and D q 
both asymptotically follow a WNS-chi distribution when sample sizes n and m are large. However, 
their convergence rates differ when n and m are different. Then the normal approximation can be 
inaccurate when n and m are not very large. In fact, a difference in convergence rates is the 
same reason that the normal approximation is not applicable under the alternative hypothesis. We 
demonstrate this with simulation studies in the Results section. 

Next, let us consider the statistic S proposed by Schaid et al. [2002], where S = (Y — 
Y) T X[(X-Xf(X-X)]- 1 X T (Y- Y) ja\ is defined based on the linear model Y = [3 + X[3 + 
o Y e with Y being the observed trait value, X being the design matrix, (3 = (fii, • • • , (3k-i), and 
e being i.i.d. normal. Schaid [2005] assumed that S follows a non-central chi-square distribution 
under the alternative hypothesis. Then the paper focused on the calculation of the non-central 
parameters under different situations of X (genotype, haplotype, or diplotype) and Y (continuous 
or case-control). In fact, S can be written as S = (Y/a Y ) T A(Y/a Y ), where A = (X — X)[(X — 
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X) T (X — X)] _1 (X — X) T . Since A 2 = A, we conclude that S follows a chi-square distribution 
with center c = fj,yAfj,y/ ' a\. In practice, c can be replaced by its consistent estimate. 

Software 

We have integrated our approaches in an R source file quadrtic.approx.R. Given the mean 
p x and variance S x of X, this R file contains the subroutines to estimate (1) the probability p = 
P{X T AX < d} for a specific d, which is useful in approximating p-values or power; (2) the 
quantile d* such that a = P{X T AX < d*} for a specific a; and (3) the required sample size for a 
specific level of significance a and power [3. This R file, as well as the readme and data files, can 



be downloaded from http://webpages.math.luc.edu/~ltong/software/ 

Results 

In the simulation studies we use the same four data sets as Sha et al. [2007]: Gene I, Gene II, 
Data I and Data II (Tables I, IV and V in Sha et al. 2007], and the same three similarity measures: 
(1) the matching measure - score 1 for complete match and otherwise; (2) the length measure - 
length spanned by the longest continuous interval of matching alleles; and (3) the counting measure 
- the proportion of alleles in common. We also explore the performance of our approximations 
using seven different sample sizes: n = m = (20, 50, 100, 500, 1000, 5000, 10000). 
Simulation studies based on the test statistic D s 

We examine the performance of our approximations under both the null and the alternative 
hypotheses. 

Examining the distribution of D s under the null hypothesis 

Under the null hypothesis, we first examine the qq-plots of our 2-cum and 4-cum approxi- 
mations for moderate sample size: n = m = 100 (Figure 1). The x-axes are the quantiles of 
D s , which are estimated based on 1.6 million independent simulations according to the true pa- 
rameter values. The y-axes are the theoretical quantiles of our approximations based on the true 
parameter values. The range of the quantiles is from 0.00001 to 0.99999. For data 1 and data 2, 
the frequencies in the control population are used. From Figure 1, we observe that most of the 
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points are around the straight line y — x, which leads to the conclusion that both the 2-cum and 
4-cum approximations are very good in general, and even when there are rare haplotypes (gene 
2, for example) and the sample size is moderate (n — m — 100). Notice that at the left tails of 
these plots, the 4-cum approximation goes above the straight line y = x. However, this does not 
affect the performance of our approximations for p-values since only the right tail is of interest. 
At the right tails, the 2-cum approximations are all below the straight line, which indicates that 
the 2-cum approximation tends to under estimate the p-values. This is further verified in Table 2 
below. The 4-cum approximation appears to perform better than the 2-cum. We also checked the 
qq-plots as the sample size increased. As expected, our approximations become better with larger 
sample sizes (results not shown here). 
[Figure 1 about here] 

The qq-plot can only show the comparison illustratively. However, it is also necessary to as- 
sess our approximations quantitatively. In this paper, we chose the two natural distances between 
any two distribution functions: the Kolmogorov distance (K-dist) and the Craimer-von Mises dis- 
tance (CM-dist). For more distance choices, see Kohl and Ruckdeschel [2009]. In general, the Kol- 
mogorov distance measures the maximum differences between two distribution functions, while 
the Craimer-von Mises distance measures the average differences throughout the support of x (See 
Appendix C for more details). We calculate the K-dist and CM-dist between our approximate 
distributions and the empirical ones based on 10K simulations under the null hypothesis for each 
combination of data set (4 in total), measure (3 in total) and sample size (7 in total). Notice that 
we did not use 1.6 million simulations here because it is computationally too intensive, especially 
when the sample size is large. In practice, we do not know the true values of p and q. Therefore, 
the variance matrix S s is replaced by a consistent estimate S s , which will affect the accuracy of 
our approximations more or less. To account for the uncertainty when using S s , we simulate 20 
samples and obtain an approximate distribution for each sample. 

We compare the performance of the 2-cum approximation, the 4-cum approximation and the 
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permutation procedure for different choices of sample sizes. We first use the true parameter values 
p(— q) for the approximations (Table 1, rows "true"). Then we simulate 20 independent samples 
and replace p(— q) and S s with p and E s (p) (see Appendix E for definitions) respectively. The 
empirical distribution based on 1,000 permutations is also calculated for each of the 20 samples. 
Since the permutation procedure can be very slow when the sample sizes n and m are large, we 
did not perform permutations when n = m > 1000. For each method, the mean and standard 
deviation of distances based on these 20 samples are displayed in Table 1, rows "mean" and "s.d.". 
To simplify the output, we show only the results for Gene I using the matching measure. 
[Table 1 here] 

From Table 1, we observe that for the 2-cum and 4-cum approximations, the mean distances 
using estimated parameter values converge to the distance using the true parameter values when 
sample size n and m increase. This is because both the asymptotic and the approximate compo- 
nents contribute to the distance. When sample sizes increase, the discrepancy due to the asymptotic 
component decreases eventually to zero, however, the discrepancy due to the approximate com- 
ponent does not. For example, the K-dist for the 4-cum method based on true parameter values 
decreases from 0.0630 to 0.0482 when the sample size increases from 20 to 50. But when the 
sample size increases from 50 to 10,000, it seems that this distance stays constant around 0.046. 
The 4-cum approximation appears better than 2-cum one if one cares about the average differ- 
ence (CM-dist). Nevertheless, the opposite may be true when the maximum difference (K-dist) is 
preferred. Compared with the permutation procedure, the proposed approximations show better 
performance for n as small as 20, and comparable performance when n is reasonably large. Note 
that our methods can be hundreds of times faster than permutations. 

The conclusions regarding the convergence of the mean distances and the performance of 
permutations are similar when using the other data sets and measures. Therefore, in Table 2, we 
consider the distances based on true parameter values only. Moreover, since the main contributor 
to the distances is approximation when sample sizes are around 100, we use only the results from 
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the case when n = m = 100 in Table 2. 
[Table 2 about here] 

From Table 2, we conclude that the 4-cum approximation performs better than the 2-cum ap- 
proximation on average when sample sizes are moderate (around 100 individual haplotypes in each 
sample). However, there are some situations when the 2-cum approximation is preferred, such as 
those in the rows "Genel", "Datall" and the column "Counting" under "K-dist" in Table 2. To find 
out how much of the distance is due to the discrete empirical distribution of D s , we also checked 
the distance between the approximate distributions with their own empirical distributions based on 
10K independent observations. The Kolmogorov distance is around 0.87% and the Cramer-von 
Mises distance is around 0.38%, which are about 20% of the distances in Table 2. This indicates 
that when the predefined significance value is moderate, such as 0.05, and the sample sizes are 
moderate, such as 100, both the 2-cum and the 4-cum approximations are appropriate. 

In addition to its general performance, we would also like to know how good the approxima- 
tions are when the significance level is very small. Ideally one should compare the approximations 
with true probabilities. However, since the theoretical distribution of D s is unknown, the only way 
to estimate the true probabilities is through simulations. When the true value of the probability is 
small, for example, 1 x 10~ 5 , we need 1.6 million simulations to ensure that the estimate is between 
p/2 and 3p/2 with 95% confidence. Here we consider moderate sample size n = m = 100. We 
estimate the critical values for significance levels a = (0.05, 0.01, 0.001, 0.0001, 0.00001) using 
the empirical distribution function of D s based on 1.6 million independent observations. For each 
combination of data set and similarity measure, we then estimate the corresponding significance 
levels using three methods: 2-cum chi-square approximation, 4-cum chi-square approximation and 
a permutation procedure based on 160K million permutations. Since under the null hypothesis we 
need the sample proportions p and q for approximation, which will confound the effect of ap- 
proximation with random errors, we examine the approximations based on both the true parameter 
values and the estimated ones from 20 simulations. It takes about 6 hours on a standard computer 
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with Intel(R) Core(TM) CPU @ 2.66 GHz and 3.00 GB of RAM to estimate p-values using permu- 
tations for these four data sets, three measures and 20 simulations. However, only two seconds are 
needed using our approximations. Moreover, when the sample size increases, the computational 
time increases rapidly for a permutation procedure, while it stays the same for our approximations. 
[Tables 3 about here] 

The results for Data II using the matching measures are summarized in Table 3. From this 
table, we can see that the 2-cum approximation performs slightly better than the 4-cum one when 
estimating a p-value around 0.05, while the 4-cum approximation is more accurate at p-values 
less than 0.01. This indicates that for a candidate gene study with significance level of 0.05, the 
2-cum approximation is preferred since it is simpler and more accurate. However, for a genome 
screen, the 4-cum approximation would be more appropriate. Notice that the 4-cum approxima- 
tion is accurate in estimation of a p-value as small as 0.1%. For probabilities around 0.01%, the 
4-cum approximation tends to slightly under-estimate the true value and therefore will result in 
higher false positive results. For the probabilities around 0.001%, we list results in the last column 
of Table 3. However, since the number of simulations is limited, we can have only modest con- 
fidence in these approximations, although it is evident that they will provide an under-estimate of 
probabilities. Note that the permutation procedure gives good estimates for a p-value as small as 
0.01% due to large number of permutations. However, in the last column of Table 3, we notice 
that the standard deviation of estimated p-values is 0.001%, which is about the same as the mean 
(0.0012%) of these estimates. This is because 160K million permutations are far too few to give 
accurate estimate of a p-value of 0.001%. The conclusions based on the other date sets are similar 
(results not shown). 

Examining the distribution of D s under the alternative hypothesis 

Similarly, we can examine the distribution of D s under the alternative hypothesis. For this 
purpose, we used Data 1 and 2 based on 160K simulations with sample sizes n = m = 100. 
The range of the quantiles is from 0.0001 to 0.9999. Note that only the 4-cum approximation is 
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available under the alternative hypothesis. From Figure 2, we observe that all the points lie close to 
a straight line, which indicates good approximations to the distribution of D s under the alternative 
hypothesis. 

[Figure 2 about here] 

Next, we examine the Kolmogorov and Cramer - von Mises distances between our approxi- 
mations and the true distribution of D s , which is estimated by the empirical distribution based on 
10K simulations. The effect of sample size is similar to what was observed under the null hypoth- 
esis. So we consider only the case when n = m = 100. Moreover, in this situation, we usually 
apply the formula to calculate power, in which case the true values of p and q are assumed to be 
known. From Table 4, we notice that the distances are all less than 0.05. Therefore, it is safe to use 
the 4-cum approximation to find the power of D s . 

[Table 4 about here] 

Similarly, we examine the performance of the 4-cum approximation in the left tail, which 
is useful in a power analysis. In this situation, we assume that the parameter values are known. 
The quantiles at (0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99) are estimated through 160K simulations. 
Table 5 summarize the results when n = m = 20, when n = m = 100 and when n = m = 1000. 
From this table, we conclude that the power estimation is fairly accurate with moderate sample 
size (n = m = 100) and moderate true power (less than 95%). 

[Table 5 about here] 
Simulations to check the distribution of the statistic D t 

Tzeng et al. [2003] claimed that under the null hypothesis, the distribution of D t = p T Ap — 
q T Aq is approximately normal with mean and variance Var(D t ). This is true sometimes, but not 
always. In fact, if only the convergence rates of p T Ap and q T Aq differ, the normal approximation 
will not be appropriate. This will occur under three situations. First, if there are several rare alleles, 
such as Gene 1 and Data 2, p and q can differ substantially even under null hypothesis (results not 
shown). Second, when the sample sizes n and m are not equal, the variances of p T Ap and q 1 Aq 
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will differ. Therefore, the convergence rates will differ (Figure 3). Third, under the alternative 
hypothesis, the convergence rates of p T Ap and q T Aq differ. Therefore, the normal approximation 
is not suitable for the above three situations. As an illustration, we use data set Data II and a 
matching measure to examine the qq-plot. The range of the quantiles is from 0.0001 to 0.9999. We 
first let n = 50 and m = 150 and then let n = 1000 and m = 3000 (Figure 3). From figures 3, we 
can see that our 4-cum chi-square approximation can approximate the distribution of D t very well 
even when the smaller sample size is as small as 50. If the smaller sample size increases to 1000, 
the normal approximation also become acceptable. 
[Figures 3 about here] 

To further compare the normal with the 4-cum chi-square approximation, we calculate the 
Kolmogorov and Cramer-von Mises distances for different combinations of data sets, measures 
and sample sizes. We assume that the size m in the second sample is three times of the size n in 
the first sample (m = 3n). For illustration purpose, we show the results for Data II only (Table 6). 
From Table 6, we observe that the chi-square approximation has much smaller distances than the 
normal one, especially when sample sizes are not very large. The conclusions on the other data 
sets are similar. 

[Table 6 about here] 
An example based on the estimation of power for a candidate gene study 

In this example we test the difference between haplotype distributions around the LCT gene 
(23 SNPs) found in populations HapMap3 CHB (n = 160) and HapMap3 JPT (m = 164). Since 
the linkage phase information is unknown, an EM algorithm was used to estimate the frequency of 
each distinct haplotype category. Under matching and length measures, the p-values of the test are 
both less than 10~ 8 , which indicates a significant difference in haplotype distributions. However, 
these two similarity measures are very sensitive to errors due to genotyping or estimation and the 
results are therefore not reliable, especially in the case of unknown phase. Using a counting mea- 
sure, the p-value is 0.026. It would then be interesting to know how many additional samples are 
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required if we want power to be, say 90%, at a significance level of 0.001, using the test statistic 
D s and the counting measure. Using the approximations described in our Methods section, we 
can easily calculate the required sample size. The quantities needed here are haplotype lists, fre- 
quencies and variance estimates for each population separately and jointly, which can be estimated 
using the EM algorithm. We first use the package haplo.stat [Sinnwell and Schaid, 2008] in R to 
find the starting value. Then we use a stochastic EM to refine the estimate and obtain the variance. 
The results are shown in Table 7. Note that all these calculations take only minutes on a standard 
computer with Intell(R) Core(TM) CPU @ 2.66 GHz and 3.00 GB of RAM. However, it requires 
at least several days to finish a single calculation using a permutation procedure. 
[Table 7 about here] 

Discussion 

In summary, the major contribution of the analytic approach presented in this paper is the 
description of the asymptotic and approximate distributions of a large class of quadratic form 
statistics used in multilocus association tests, as well as efficient ways to calculate the p-value and 
power of a test. Specifically, we have shown that the asymptotic distribution of the quadratic form 
s 1 As is a linear combination of chi-square distributions. In this situation, s asymptotically follows 
a multivariate normal distribution which may be degenerate. 

To efficiently calculate the p-value under the null hypothesis s = E(s) = 0, we propose 2- 
cum and 4-cum chi-square approximations to the distribution of s T As. We extended the 4-cum 
approximation in Liu et al. [2009] to allow degenerate s and general symmetric A which may not 
be positive semi-definite. Generally speaking, the 4-cum is better than the 2-cum approximation 
when dealing with probabilities less than 0.01. Nevertheless, the latter may perform better for 
moderate probabilities, say 0.05. On the other hand, the 2-cum method only involves the products 
of up to two k x k matrices, while the 4-cum approach relies on a product of four k x k matrices. 
When the number of haplotypes k is large, the 2-cum approach is computationally much less 
intensive. To estimate the power of a test, however, only the 4-cum approximation is valid. 
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The similarity matrix A can be singular or approximately singular due to missing values. In 
this case, we decompose A and perform dimension reduction to get a smaller but nonsingular 
similarity matrix. The most attractive feature of our method is that we do not need to decompose 
matrices S s or W when A is positive semi-definite because the decompositions do not appear in 
the final formula. This not only simplifies the formula, but also results in better computational 
properties since it is often hard to estimate S s accurately. 

In this paper we do not consider the effect of latent population structure. It has been widely 
recognized that the presence of undetected population structure can lead to a higher false positive 
error rate or to decreased power of association testing [Marchini et al. 2004]. Several statistical 
methods have been developed to adjust for population structure [ Devlin and Roeder 1999, Prichard 
and Rosenberg 1999, Pritchard et al. 2000, Reich and Goldstein 2001, Bacanu et al. 2002, Price et 
al. 2006]. These methods mainly focus on the effect of population stratification on the Cochran- 
Armitage chi-square test statistic. It would be interesting to know how these methods can be 
applied to the similarity or distance-based statistic to conduct association studies in the presence 
of population structure. 

Our methods can potentially be applied to the genome-wide association studies because the 
computations are fast and small probabilities can be estimated with acceptable variation. To per- 
form a genome screen one must define the regions of interest manually, which will be exceedingly 
tedious. However, due to limitation in length, we do not discuss the problem of how to define hap- 
lotype regions automatically. Clearly before this approach can be applied in practice, such methods 
and software will have to be developed. We also propose to explore this issue in the future. 
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A: Proof that D s can be written as a linear combination of independent chi-square random 
variables under the alternative hypothesis 

Start with © and W = B T AB = VVlV T . Then 

Z T B T ABZ = Z T WZ = Z T V ■ n ■ V T Z = Y T VLY 

s T ABZ = s T ABVn- 1 ■ n ■ V T Z = b T VLY 
where Y = V T Z ~ N(0, J r J. Let c = s T As - b T VLb. We have 

D s w Z T B T ABZ + 2s T ABZ + s T As 
= F T fir + 2b T ttY + s T As 
= (Y + b) T Q{Y + b) + s T As - b T Qb 

To 

= ^UiiYi + bi) 2 + c 
B: Four-cumulant non-central chi-square approximation 

Rewrite the original statistic D s = s T As into its asymptotic form (Y + b) T Vt(Y + b) + c (see 
Appendix A). We only need to consider the shifted quadratic form 

Q(Y b ) = Y b T nY b + c 

(see ©), where Y b = Y + b ~ iV(6, I r£r ), and = diag(c<Ji, . . . , u r<j ) with u\ > co 2 > ■ ■ ■ > LO r „ > 
0. 

According to Liu et al. [2009], the z/fh cumulant of Q(Yb) is 

k„ = T~ x {v- 1) !(«„,! + I/«„, 2 ) 

In our case, for v = 1, 2, 3, 4, 

/c V) i = tr(lT) = trftV^W)") = tr(W") = tr((5 T Afi)^) = tr((AS s ) I/ ) 
And for v — 1, 

k„, 2 = 6 r fi& + c = 6 T fi6 + s T As - 6 T fi6 = s T As 
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For v = 2,3,4, 

Kv2 = b T Q u b 

= s T AU a (A rT )^VQ- 1 ■ n v ■ Q-^iA^UjAs 
= s T AU a (A^VWV T (A a )WjAs 
= s T AB{yVtV T y~ 2 B T As 
= s T AB{B T ABy- 2 B T As 
= s T {AT ls ) v - 1 As 

Therefore, 

«„ = 2 u -\u-l)\{tT({AJ: s y) + us T (AE s ) v - 1 As), u = 1,2,3,4 

which actually takes the same form as in Liu et al. [2009]. So the discussion here extends Liu et 
Al. [2009] 's formulas to more general quadratic form which allows degenerate multivariate normal 
distribution. 

C: Distance between a continuous distribution and an empirical distribution 

To compare one continuous cumulative distribution function F 1 and one empirical distribution 
F 2 (or discrete distribution), two natural distances are the Kolmogorov distance 

d K (F 1 ,F 2 )=sup\F 1 (x)-F 2 (x)\ 

X 

and the Cramer- von Mises distance with measure (j, — Fi 

d cv (Fi,F 2 ) = ^ [FxOr) - F 2 (x)] 2 dF 1 {x) y j * 

Note that F 2 is piecewise constant. Let x±, x 2 , . . . , x n be all distinct discontinuous points of 
F 2 . We keep them in an increasing order. If F 2 is an empirical distribution, x\, x 2) . . . , x n are 
distinct values of the random sample which generates F 2 . Write x = — oo. 

For Kolmogorov distance, the maximum can be obtained by checking all the discontinuous 
points of F 2 . Therefore, 

d K {F u F 2 ) = maxflFifc) - F 2 (^)|} V/ max {(F^^) - F^x^} 

i " i 
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For Cramer- von Mises distance, 



/x\ poo 
F 1 (x) 2 dF 1 (x) + [1 - F l {x)fdF l {x) 
-oo J x n 

+ E r +1 [F 1 (x)-F 2 (x l )] 2 dF 1 (x) 

• i J Xi 



= l^( Xl ) + ~[i - i^)] 3 

1 n— 1 

+ o - F 2 {x l )f - [F x {xi) - F 2 { Xl )f) 

6 i=i 

Note that the formulas above work better than the corresponding R functions in the package "dis- 



trEx" (downloadable via http://cran.r-project.org/). Those R functions have difficulties with large 
sample sizes (say n > 2000), because their calculation replies on the grids on the real line. 
D: Calculate the difference between two non-central chi-squares 

Let Y\ and Y 2 be two independent non-central chi-square random variables with probability 
density function fi(y) and f 2 (y) respectively. Write Z — Yj. — Y 2 . Then the probability density 
function f(z) of Z can be calculated through 

POO 

f(z) = / fi(z + y)f 2 (y)dy 



CO 

1 



fi[z + log — — ) f 2 ( log — — J • — -dx 

\ 1 — x J \ 1 — X J x(l — X) 

The cumulative distribution function F(z) of Z can be calculated through 

/oo r-z 
I fi(yi + y2)f2(y2)dyidy 2 
-oo J —oo 



l + e z 



XlX 2 \ n ( , X 2 



h log — r f 2 log 



J0 \ (1 -#l)(l - X 2 ) J V 1-SC2 

XiX 2 {l - - x 2 ) 

Note that we perform the transformation y = log(a;/(l — x)) in both formulas to convert the 
integrating interval from (— oo, oo) into (0, 1) for numerical integration purpose. 
E: Simplified formulas for tr(I40 and tr(VT 2 ) when phase is known 
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Let p — (pi, ■ ■ ■ , pk), where pi = (npi + mqi) / (n + m), i = 1, . . . ,k. Then under the null 
hypothesis, /)j is a consistent estimate of pi (= ^). It follows that S s = £ s (/5) = (l/n+l/m)(.R — 
pp T ) is a consistent estimate of S s , where _R = diag(pi, • • • , pk). Since /2 is a diagonal matrix and 
p is a vector, the calcualtion of tr(W) and tr(W 2 ) can be further simplified as 

tr (^0 = ( \ + ^ J I Yl a iiPA l ~ Pi) ~ 2 _ZJ a hhPhPh J 

\j=l jl=ljz>jl / 



x ' Lj=i 

+ 2 tfihPiiPhO- ~ Ph - Pn) 

jl=l J2>jl 

- 4 PiiPj 2 a ih a ij2pi 

jl = lj2>jl 1=1 

(k k \ 2 

i=i ji=ij2>ji / 

It is important to point out that the degrees of freedom d/ = ti(W) 2 / tr(W 2 ) do not depend 
on sample sizes n and m according to the above formulas. 
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Figures 



Figure 1: The qq-plots of the 2-cum (red) and 4-cum (blue) approximations to the 
distribution of D s (based on 1.6 million simulations) under the null hypothesis using 
gene 1 (first row), gene 2 (second row), data 1 (third row) and data 2 (fourth row). The 
black solid line is y = x. We use the true values of p and q here. The left, middle, 
and right columns are for matching, length, and counting measures respectively. The 
sample sizes are m = n = 100. 
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Figure 2: The qq-plots of the 4-cum (blue) approximations to the distribution of D s 
(based on 160K simulations) under the alternative hypothesis using data 1 (first row) 
and data 2 (second row). The black solid line is y = x. We use the true values of p 
and q here. The left, middle, and right columns are for matching, length, and counting 
measures respectively. The sample sizes are m = n = 100. 
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Figure 3: The qq-plots of the 4-cum chi-square approximation (blue "4") and the 
normal approximation (red "n") to the distribution of D t under the null hypothesis 
using Gene II and the matching measure. We use the true values of p and q here. The 
left plot has a smaller sample size n = 50 and m = 150. The right plot has a larger 
sample size n = 1000 and m = 3000. 
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Tables 

TABLE 1. Kolmogorov and Cramer- von Mises distances (%) under the null 
hypothesis for Gene I using matching measure 



sample size (n = m) 

Distance Method 20 50 100 500 1000 5000 10000 

true 5.72 4.95 4.71 4.13 3.77 4.23 3.69 

2-cum mean 8.69 7.55 5.68 4.21 4.00 4.23 3.70 

s.d. 2.71 2.90 1.55 0.56 0.45 0.21 0.14 

true 6.30 4.82 4.54 4.65 4.70 4.51 4.75 

K-dist 4-cum mean 8.76 6.81 4.80 4.57 4.61 4.52 4.77 

s.d. 3.43 3.37 1.11 0.48 0.34 0.14 0.09 

perm. mean 10.39 6.74 4.16 3.00 NA NA NA 

s.d. 3.16 2.89 1.15 1.18 NA NA NA 

true 2.25 2.35 2.05 2.21 2.00 2.31 2.02 

2-cum mean 4.18 3.81 2.63 2.30 2.08 2.31 2.02 

s.d. 1.71 1.67 0.68 0.11 0.14 0.04 0.02 

true 1.98 1.47 1.24 1.20 1.38 1.52 1.31 

CM-dist 4-cum mean 4.15 3.38 2.10 1.35 1.54 1.53 1.32 

s.d. 2.23 2.24 1.03 0.26 0.23 0.10 0.05 

perm. mean 4.32 3.21 1.96 1.29 NA NA NA 

s.d. 2.27 1.91 0.71 0.70 NA NA NA 



TABLE 2. Kolmogorov and Cramer-von Mises distances under the null hypothesis 
when sample sizes n = m = 100 
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1.08 


2.46 


2.73 



TABLE 3. Comparison of probabilities in the right tail for Data II using 
matching measure when n = m = 100. 











P = % 






Data Method 




5 


1 


0.1 


0.01 


0.001 




true 


4.9724 


0.7977 


0.0483 


0.0024 


0.0002 


2-cum 


mean 


5.0302 


0.8134 


0.0503 


0.0027 


0.0002 




s.d. 


0.1619 


0.0733 


0.0102 


0.0009 


0.0001 


Data II 


true 


5.1828 


1.0273 


0.0929 


0.0076 


0.0008 


4-cum 


mean 


5.2266 


1.0297 


0.0926 


0.0076 


0.0008 




s.d. 


0.1331 


0.0753 


0.0161 


0.0022 


0.0003 



mean 5.0482 0.9976 0.1011 0.0104 0.0012 
s.d. 0.1602 0.0771 0.0238 0.0033 0.0010 



TABLE 4. Kolmogorov and Cramer- von Mises distances under the alternative 
hypothesis when n = m = 100 (4-cum only) 







K-dist 






CV-dist 




Data 


Matching 


Length 


Counting 


Matching 


Length 


Counting 


Data I 
Data II 


0.0076 
0.0133 


0.0132 
0.0312 


0.0176 
0.0401 


0.0028 
0.0045 


0.0055 
0.0101 


0.0069 
0.0065 



TABLE 5. Comparison of probabilities in the left tail (4-cum only) 





Sample 








Power ( 


%) 






Data Measure 


Size 


50 


60 


70 


80 


90 


95 


99 




20 


48.59 


56.45 


65.41 


80.95 


92.40 


98.01 


100.00 


Matching 


100 


50.10 


59.63 


69.71 


79.27 


89.64 


95.62 


100.00 




1000 


50.17 


60.10 


70.17 


79.89 


90.00 


95.00 


99.01 




20 


48.17 


57.12 


67.11 


78.42 


96.29 


99.63 


99.95 


Data II Length 


100 


50.00 


59.73 


69.26 


78.91 


89.48 


96.80 


99.91 




1000 


50.13 


60.22 


70.13 


80.01 


89.97 


95.01 


99.06 




20 


48.41 


58.54 


67.59 


79.45 


96.01 


100.00 


100.00 


Counting 


100 


49.92 


59.79 


69.54 


79.19 


90.00 


97.12 


100.00 




1000 


49.92 


59.92 


69.92 


79.95 


90.05 


94.99 


99.01 
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TABLE 6: Comparison of distances for (4-cum) chi-square and normal approximations 













sample size n(m = 


3n) 




Measure 


Distance 


Method 


20 


50 


100 


500 


1000 


5000 


Matching 


K-dist 


Chi-sq 
Normal 


0.0288 
0.2030 


0.0187 
0.1325 


0.0116 
0.0915 


0.0047 
0.0408 


0.0077 
0.0324 


0.0068 
0.0144 




CM-dist 


Chi-sq 
Normal 


0.0154 
0.1494 


0.0096 
0.1021 


0.0059 
0.0694 


0.0022 
0.0314 


0.0028 
0.0237 


0.0025 
0.0085 


Length 


K-dist 


Chi-sq 
Normal 


0.0269 
0.1779 


0.0163 
0.1160 


0.0054 
0.0805 


0.0072 
0.0365 


0.0074 
0.0267 


0.0093 
0.0099 




CM-dist 


Chi-sq 
Normal 


0.0127 
0.1191 


0.0079 
0.0805 


0.0020 
0.0541 


0.0027 
0.0248 


0.0035 
0.0147 


0.0035 
0.0054 


Counting 


K-dist 


Chi-sq 
Normal 


0.0246 
0.1721 


0.0174 
0.1112 


0.0090 
0.0757 


0.0078 
0.0333 


0.0087 
0.0233 


0.0070 
0.0127 




CM-dist 


Chi-sq 
Normal 


0.0122 
0.1089 


0.0085 
0.0694 


0.0036 
0.0456 


0.0029 
0.0208 


0.0040 
0.0161 


0.0033 
0.0084 



TABLE 7: Sample sizes required given significance level and power 



Power (%) 



Significance (%) 


70 




80 




90 




CHB 


JPT 


CHB 


JPT 


CHB 


JPT 


1 


181 


186 


203 


208 


234 


240 


0.1 


275 


282 


302 


309 


339 


348 


0.01 


366 


375 


395 


405 


438 


449 


0.001 


435 


446 


467 


479 


513 


526 



